Edward Harvey, Yifan Li, Sandani Kumanayake
2023-03-27
Do high college costs mean high pay-off in future earnings?
What about other factors?
Accessed via Kaggle.com. Four files from two sources:
Diversity by college/university for 2014, from the Chronicle of Higher Education (4575 unique values)
Tuition and fees by college/university for 2018-2019, from the Chronicle of Higher Education (2 datasets) (2938 and 3664 unique values)
Salary potential data comes, from payscale.com (934 unique values)
Different time periods.
Not whole population.
Many NA values.
setwd("/Users/sandanikumanayake/Desktop/STAT 697/HW 05/Final/")
#read in diversity dataset
diversity <- read.csv2("diversity_school.csv", sep=",")
#separate total enrollment before pivot, take distinct entries by name and state
total_enrollment <- diversity %>% select(name, state, total_enrollment) %>%
distinct()
#drop total enrollment from diversity
diversity <- diversity %>% select(-total_enrollment)
#convert category and state to factor (not sure if this makes a difference?)
diversity$category <- as.factor(diversity$category)
diversity$state <- as.factor(diversity$state)
#pivot demographic data (saved as new df)
diversity_pivot <- diversity %>%
filter(!is.na(name)) %>%
pivot_wider(id_cols=c(name, state), names_from=category, values_from = enrollment)
#here I rejoin the total enrollment and diversity_pivot is complete
diversity_pivot <- left_join(diversity_pivot, total_enrollment, by=c("name", "state"))
#load in the salary potential data
salary_potential <- read.csv2("salary_potential.csv", sep=",")
salary_potential <- salary_potential %>% rename("state" = "state_name")
#join diversity and salary potential, save as new df
df <- left_join(diversity_pivot, salary_potential, by=c("name", "state"))
#join df with tuition_cost
tuition_cost <- read.csv2("tuition_cost.csv", sep=",")
df <- left_join(df, tuition_cost, by=c("name", "state"))
#drop state code (redundant) and rank(?)
df <- df %>% select(-c("state_code","rank"))
#load tuition income dataset
tuition_income <- read.csv2("tuition_income.csv", sep=",")
#select 2018 only, the most recent data (other datasets do not have multiple years)
tuition_income <- tuition_income %>%
filter(year==2018) %>%
select(name, state, total_price, campus) %>%
distinct()
#convert to factor (is this necessary?)
tuition_income$campus[which(tuition_income$name=="Laurel Technical Institute" & tuition_income$total_price==16168)] <- 'On Campus'
tuition_income$campus <- as.factor(tuition_income$campus)
#pivot wider
tuition_income_pivot <- tuition_income %>% pivot_wider(names_from = campus, values_from = total_price)
#change null values to NA for On Campus
tuition_income_pivot$`On Campus`[sapply(tuition_income_pivot$`On Campus`, is.null)] <- NA
#convert state codes to state
tuition_income_pivot$state <- abbr2state(tuition_income_pivot$state)
#join
df <- left_join(df, tuition_income_pivot, by=c("name", "state"))
colnames(tuition_income_pivot[,3:4]) <- c("on_campus","off_campus")
#save as RDS file
write.csv2(df, file="df.csv")
saveRDS(df, file="df.rds")Some of the more expensive private universities show higher career earnings, but for many this is not the case
Public universities show a more positive and linear relationship between costs and future earnings
Strong positive relationship between % STEM and future earnings
Universities with highest future earnings have a roughly balanced gender mix
Male-dominated schools tend to be technical universities with higher future earnings compared to female-dominated schools